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Abstract 

We address the problem of automatically learning the 
main steps to complete a certain task, such as changing a 
car tire, from a set of narrated instruction videos. The con¬ 
tributions of this paper are three-fold. First, we develop a 
new unsupervised learning approach that takes advantage 
of the complementary nature of the input video and the as¬ 
sociated narration. The method solves two clustering prob¬ 
lems, one in text and one in video, applied one after each 
other and linked by joint constraints to obtain a single co¬ 
herent sequence of steps in both modalities. Second, we col¬ 
lect and annotate a new challenging dataset of real-world 
instruction videos from the Internet. The dataset contains 
about 800,000 frames for five different tasks * 1 that include 
complex interactions between people and objects, and are 
captured in a variety of indoor and outdoor settings. Third, 
we experimentally demonstrate that the proposed method 
can automatically discover, in an unsupervised manner, the 
main steps to achieve the task and locate the steps in the 
input videos. 


1. Introduction 

Millions of people watch narrated instruction videos 2 
to learn new tasks such as assembling IKEA furniture or 
changing a flat car tire. Many of such tasks have large 
amounts of videos available on-line. For example, query¬ 
ing for “how to change a tire” results in more than 300,000 
hits on YouTube. Most of these videos, however, are made 
with the intention to teach other people to perform the task 
and do not provide direct supervisory signal for automatic 
learning algorithms. Developing unsupervised methods that 
could learn tasks from myriads of instruction videos on the 
Internet is therefore a key challenge. Such automatic cogni¬ 
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1 How to : change a car tire, perform CardioPulmonary resuscitation 
(CPR), jump a car, repot a plant and make coffee 

2 Some instruction videos on YouTube have tens of millions of views, 
e.g. www. youtube . com/watch?v=J4-GRH2nDvw. 


tive ability would enable constructing virtual assistants and 
smart robots that learn new skills from the Internet to, for 
example, help people achieve new tasks in unfamiliar situa¬ 
tions. 

In this work, we consider instruction videos and develop 
a method that learns a sequence of steps, as well as their 
textual and visual representations, required to achieve a cer¬ 
tain task. For example, given a set of narrated instruction 
videos demonstrating how to change a car tire, our method 
automatically discovers consecutive steps for this task such 
as loosen the nuts of the wheel , jack up the car , remove 
the spare tire and so on as illustrated in Figure 1 . In addi¬ 
tion, the method learns the visual and linguistic variability 
of these steps from natural videos. 

Discovering key steps from instruction videos is a highly 
challenging task. First, linguistic expressions for the same 
step can have high variability across videos, for example: 
“...Foosen up the wheel nut just a little before you start jack¬ 
ing the car...” and “...Start to loosen the lug nuts just enough 
to make them easy to turn by hand...”. Second, the visual ap¬ 
pearance of each step varies greatly between videos as the 
people and objects are different, the action is captured from 
a different viewpoint, and the way people perform actions 
also vary. Finally, there is also a variability of the overall 
structure of the sequence of steps achieving the task. For ex¬ 
ample, some videos may omit some steps or change slightly 
their order. 

To address these challenges, in this paper we develop an 
unsupervised learning approach that takes advantage of the 
complementarity of the visual signal in the video and the 
corresponding natural language narration to resolve their 
ambiguities. We assume that the same ordered sequence 
of steps (also called script in the NFP literature [26]) is 
common to all input videos of the same task, but the ac¬ 
tual sequence and the individual steps are unknown and are 
learnt directly from data. This is in contrast to other existing 
methods for modeling instruction videos [19] that assume a 
script (recipe) is known and fixed in advance. We address 
the problem by first performing temporal clustering of text 
followed by clustering in video, where the two clustering 
tasks are linked by joint constraints. The complementary 
nature of the two clustering problems helps to resolve am¬ 
biguities in the two individual modalities. For example, two 
video segments with very different appearance but depict- 


1 






Start by loosening each bolt. Then locate the jack and lift the car. Now you can remove 
the bolts and then the wheel. 




First undo the nuts. Once that done, you can jack the car. Then withdraw the nuts completely 
™ 7 so that you can remove the flat tire. 


Figure 1: Given a set of narrated instruction videos demonstrating a particular task, we wish to automatically discover the main steps 
to achieve the task and associate each step with its corresponding narration and appearance in each video. Here frames from two videos 
demonstrating changing the car tire are shown, together with excerpts of the corresponding narrations. Note the large variations in both the 
narration and appearance of the different steps highlighted by the same colors in both videos (here only three steps are shown). 


ing the same step can be grouped together because they are 
narrated in a similar language. Conversely, two video seg¬ 
ments described with very different expressions, for exam¬ 
ple, “jack up the car” and “raise the vehicle” can be identi¬ 
fied as belonging to the same instruction step because they 
have similar visual appearance. The output of our method is 
the script listing the discovered steps of the task as well as 
the temporal location of each step in the input videos. We 
validate our method on a new dataset of instruction videos 
composed of five different tasks with a total of 150 videos 
and about 800,000 frames. 

2. Related work 

This work relates to unsupervised and weakly- 
supervised learning methods in computer vision and nat¬ 
ural language processing. Particularly related to ours is the 
work on learning script-like knowledge from natural lan¬ 
guage descriptions [6, 11, 26]. These methods aim to dis¬ 
cover typical events (steps) and their order for particular 
scenarios (tasks) 3 such as “cooking scrambled egg”, “tak¬ 
ing a bus” or “making coffee”. While [6] uses large-scale 
news copora, [26] argues that many events are implicit and 
are not described in such general-purpose text data. In¬ 
stead, [11, 26] use event sequence descriptions collected for 
particular scenarios. Differently to this work, we learn se¬ 
quences of events from narrated instruction videos on the 
Internet. Such data contains detailed event descriptions but 
is not structured and contains more noise compared to the 
input of [11, 26]. 

Interpretation of narrated instruction videos has been re¬ 

3 We here assign the same meaning to terms “event” and “step” as well 
as to terms “script” and “task”. 


cently addressed in [19]. While this work analyses cooking 
videos at a great scale, it relies on readily-available recipes 
which may not be available for more general scenarios. Dif¬ 
ferently from [19], we here aim to learn the steps of instruc¬ 
tion videos using a discriminative clustering approach. A 
similar task to ours is addressed in [21] using latent variable 
structured perceptron algorithm to align nouns in instruc¬ 
tion sentences with objects touched by hands in instruction 
videos. However, similarly to [19], [21] uses laboratory ex¬ 
perimental protocols as textual input, whereas here we con¬ 
sider a weaker signal in the form of the real transcribed nar¬ 
ration of the video. 

In computer vision, unsupervised action recognition has 
been explored in simple videos [23]. More recently, weakly 
supervised learning of actions in video using video scripts 
or event order has been addressed in [3, 4, 5, 9, 16]. Par¬ 
ticularly related to ours is the work [4] which explores the 
known order of events to localize and learn actions in train¬ 
ing data. While [4] uses manually annotated sequences of 
events, we here discover the sequences of main events by 
clustering transcribed narrations of the videos. Related is 
also the work of [5] that aligns natural text descriptions to 
video but in contrast to our approach does not discover au¬ 
tomatically the common sequence of main steps. Methods 
in [22, 25] learn in an unsupervised manner the temporal 
structure of actions from video but do not discover textual 
expressions for actions as we do in this work. The recent 
concurrent work [27] is addressing, independently of our 
work, a similar problem but with a different approach based 
on a probabilistic generative model and considering a dif¬ 
ferent set of tasks mainly focussed on cooking activities. 

Our work is also related to video summarization and 
in particular to the recent work on category-specific video 























summarization [24, 29]. While summarization is a subjec¬ 
tive task, we here aim to extract the key steps required to 
achieve a concrete task that consistently appear in the same 
sequence in the input set of videos. In addition, unlike video 
summarization [24, 29] we jointly exploit visual and lin¬ 
guistic modalities in our approach. 

3. New dataset of instruction videos 

We have collected a dataset of narrated instruction videos 
for five tasks: Making a coffee , Changing car tire , Per¬ 
forming cardiopulmonary resuscitation (CPR), Jumping a 
car and Repotting a plant. The videos were obtained by 
searching YouTube with relevant keywords. The five tasks 
were chosen so that they have a large number of available 
videos with English transcripts while trying to cover a wide 
range of activities that include complex interactions of peo¬ 
ple with objects and other people. For each task, we took the 
top 30 videos with English ASR returned by YouTube. We 
also quickly verified that each video contains a person ac¬ 
tually performing the task (as opposed to just talking about 
it). The result is a total of 150 videos, 30 videos for each 
task. The average length of our videos is about 4,000 frames 
(or 2 minutes) and the entire dataset contains about 800,000 
frames. 

The selected videos have English transcripts obtained 
from YouTube’s automatic speech recognition (ASR) sys¬ 
tem. To remove the dependence of results on errors of the 
particular ASR method, we have manually corrected mis¬ 
spellings and punctuations in the output transcriptions. We 
believe this step will soon become obsolete given rapid im¬ 
provements of ASR methods. As we do not modify the con¬ 
tent of the spoken language in videos, the transcribed verbal 
instructions still represent an extremely challenging exam¬ 
ple of natural language with large variability in the used 
expressions and terminology. Each word of the transcript 
is associated with a time interval in the video (usually less 
than 5 seconds) obtained from the closed caption timings. 

For the purpose of evaluation, we have manually anno¬ 
tated the temporal location in each video of the main steps 
necessary to achieve the given task. For all tasks, we have 
defined the ordered sequence of ground truth steps before 
running our algorithm. The choice of steps was made by 
an agreement of 2-3 annotators who have watched the in¬ 
put videos and verified the steps on instruction video web¬ 
sites such as http : / /www. howdini . com. While some 
steps can be occasionally left out in some videos or the or¬ 
dering slightly modified, overall we have observed a good 
consistency in the given sequence of instructions among the 
input videos. We measured that only 6% of the step anno¬ 
tations did not fit the global order, while a step was miss¬ 
ing from the video 27% of the time. 4 * We hypothesize that 
this could be attributed to the fact that all videos are made 

4 We describe these measurements in more details in the supplemen¬ 

tary material given in Appendix A. 1 . 


with the same goal of giving other humans clear, concise 
and comprehensible verbal and visual instructions on how 
to achieve the given task. Given the list of steps for each 
task, we have manually annotated each time interval in each 
input video to one of the ground truth steps (or no step). 
The actions of the individual steps are typically separated by 
hundreds of frames where the narrator transitions between 
the steps or explains verbally what is going to happen. Fur¬ 
thermore, some steps could be missing in some videos, or 
could be present but not described in the narration. Finally, 
the temporal alignment between the narration and the ac¬ 
tual actions in video is only coarse as the action is often 
described before it is performed. 


4. Modelling narrated instruction videos 

We are given a set of N instruction videos all depicting 
the same task (such as “changing a tire”). The n-th input 
video is composed of a video stream of T n segments of 
frames {x 7 ^) t 2 ll and an audio stream containing a detailed 
verbal description of the depicted task. We suppose that the 
audio description was transcribed to raw text and then pro¬ 
cessed to a sequence of S n text tokens )f= x . Given this 
data, we want to automatically recover the sequence of K 
main steps that compose the given task and locate each step 
within each input video and text transcription. 

We formulate the problem as two clustering tasks, one 
in text and one in video, applied one after each other and 
linked by joint constraints linking the two modalities. This 
two-stage approach is based on the intuition that the vari¬ 
ation in natural language describing each task is easier to 
capture than the visual variability of the input videos. In the 
first stage, we cluster the text transcripts into a sequence of 
K main steps to complete the given task. Empirically, we 
have found (see results in Sec. 5.1) that it is possible to dis¬ 
cover the sequence of the K main steps for each task with 
high precision. However, the text itself gives only a poor 
localization of each step in each video. Therefore, in the 
second stage we accurately localize each step in each video 
by clustering the input videos using the sequence of K steps 
extracted from text as constraints on the video clustering. 
To achieve this, we use two types of constraints between 
video and text. First, we assume that both the video and 
the text narration follow the same sequence of steps. This 
results in a global ordering constraint on the recovered clus¬ 
tering. Second, we assume that people perform the action 
approximately at the same time that they talk about it. This 
constraint temporally links the recovered clusters in text and 
video. The important outcome of the video clustering stage 
is that the K extracted steps get propagated by visual sim¬ 
ilarity to videos where the text descriptions are missing or 
ambiguous. 

We first describe the text clustering in Sec. 4.1 and then 
introduce the video clustering with constraints in Sec. 4.2. 
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Figure 2: Clustering transcribed verbal instructions. Left: The input raw text for each video is converted into a sequence of direct 
object relations. Here, an illustration of four sequences from four different videos is shown. Middle: Multiple sequence alignment is used 
to align all sequences together. Note that different direct object relations are aligned together as long as they have the same sense, e.g. 
“loosen nut” and “undo bolt”. Right: The main instruction steps are extracted as the K = 3 most common steps in all the sequences. 


4.1. Clustering transcribed verbal instructions 

The goal here is to cluster the transcribed verbal descrip¬ 
tions of each video into a sequence of main steps necessary 
to achieve the task. This stage is important as the result¬ 
ing clusters will be used as constraints for jointly learning 
and localizing the main steps in video. We assume that the 
important steps are common to many of the transcripts and 
that the sequence of steps is (roughly) preserved in all tran¬ 
scripts. Hence, following [26], we formulate the problem 
of clustering the input transcripts as a multiple sequence 
alignment problem. However, in contrast to [26] who clus¬ 
ter manually provided descriptions of each step, we wish 
to cluster transcribed verbal instructions. Hence our main 
challenge is to deal with the variability in spoken natural 
language. To overcome this challenge, we take advantage 
of the fact that completing a certain task usually involves 
interactions with objects or people and hence we can extract 
a more structured representation from the input text stream. 

More specifically, we represent the textual data as a se¬ 
quence of direct object relations. A direct object relation d 
is a pair composed of a verb and its direct object comple¬ 
ment, such as “remove tire”. Such a direct object relation 
can be extracted from the dependency parser of the input 
transcribed narration [8]. We denote the set of all differ¬ 
ent direct object relations extracted from all narrations as 
V , with cardinality D. For the n-th video, we thus repre¬ 
sent the text signal as a sequence of direct object relation 
tokens: d n = (d™,..., dg ), where the length S n of the se¬ 
quence varies from one video clip to another. This step is 
key to the success of our method as it allows us to convert 
the problem of clustering raw transcribed text into an easier 
problem of clustering sequences of direct object relations. 
The goal is now to extract from the narrations the most com¬ 
mon sequence of K main steps to achieve the given task. To 
achieve this, we first find a globally consistent alignment of 
the direct object relations that compose all text sequences 
by solving a multiple sequence alignment problem. Second, 
we pick from this alignment the K most globally consistent 
clusters across videos. 

Multiple sequence alignment model. We formulate the 
first stage of finding the common alignment between the 


input sequences of direct object relations as a multiple se¬ 
quence alignment problem with the sum-of-pairs score [31]. 
In details, a global alignment can be defined by re-mapping 
each input sequence d n of tokens to a global common tem¬ 
plate of L slots, for L large enough. We let (<j>(d n ))i<i<L 
represent the (increasing) re-mapping for sequence d n at the 
new locations indexed by Z: c/)(d n )i represents the direct ob¬ 
ject relation put at location Z, with </>(d n )j = 0 if a slot 
is left empty (denoting the insertion of a gap in the origi¬ 
nal sequence of tokens). See the middle of Figure 2 for an 
example of re-mapping. The goal is then to find a global 
alignment that minimizes the following sum-of-pairs cost 
function: L 

£ Y,dHd n )i,Hd m )i), (i) 

(n,m) 1=1 

where c(di, cfe) denotes the cost of aligning the direct ob¬ 
ject relations d\ and at the same common slot Z in the 
global template. The above cost thus denotes the sum of all 
pairwise alignments of the individual sequences (the outer 
sum), where the quality of each alignment is measured by 
summing the cost c of matches of individual direct object 
relations mapped into the common template sequence. We 
use a negative cost when d\ and d 2 are similar according to 
the distance in the WordNet tree [10, 20] of their verb and 
direct object constituents, and positive if they are dissimi¬ 
lar (details are given in Sec. 5). As the verbal narrations 
can talk about many other things than the main steps of a 
task, we set c(d, d') = 0 if either d or d! is 0. An illustra¬ 
tion of clustering the transcribed verbal instructions into a 
sequence of K steps is shown in Figure 2. 

Optimization using Frank-Wolfe. Optimizing the 
cost (1) is NP-hard [31] because of the combinatorial na¬ 
ture of the problem. The standard solution from compu¬ 
tational biology is to apply a heuristic algorithm that pro¬ 
ceeds by incremental pairwise alignment using dynamic 
programming [17]. In contrast, we show in Appendix B.l 
that the multiple sequence alignment problem given by (1) 
can be reformulated as an integer quadratic program with 
combinatorial constraints, for which the Frank-Wolfe op¬ 
timization algorithm has been used recently with increas¬ 
ing success [4, 13, 14, 15]. Interestingly, we have ob¬ 
served empirically (see Appendix B.2) that the Frank-Wolfe 





























algorithm was giving better solutions (in terms of objec¬ 
tive ( 1 )) than the state-of-the-art heuristic procedures for 
this task [12, 17]. Our Frank-Wolfe based solvers also offer 
us greater flexibility in defining the alignment cost and scale 
better with the length of input sequences and the vocabulary 
of direct object relations. 

Extracting the main steps. After a global alignment is 
obtained, we sort the global template l by the number of di¬ 
rect object relations aligned to each slot. Given K as input, 
the top K slots give the main instruction steps for the task, 
unless there are multiple steps with the same support, which 
go beyond K. In this case, we pick the next smaller number 
below K which excludes these ties, allowing the choice of 
an adaptive number of main instruction steps when there is 
not enough saliency for the last steps. This strategy essen¬ 
tially selects k < K salient steps, while refusing to make a 
choice among steps with equal support that would increase 
the total number of steps beyond K. As we will see in our 
results in Sec. 5.1, our algorithm sometimes returns a much 
smaller number than K for the main instruction steps, giv¬ 
ing more robustness to the exact choice of parameter K. 

Encoding of the output. We post-process the output 
of multiple sequence alignment into an assignment matrix 
R n G {0, l ys n xic f or eac h i n p U t video n, where ( R n )sk = 
1 means that the direct object token d™ has been assigned to 
step k. If a direct object has not been assigned to any step, 
the corresponding row of the matrix R n will be zero. 

4.2. Discriminative clustering of videos under text 
constraints 

Given the output of the text clustering that identified the 
important K steps forming a task, we now want to find 
their temporal location in the video signal. We formalize 
this problem as looking for an assignment matrix Z n G 
{ 0 , l} T n xK for each input video n, where (. Z n ) t k = 1 in¬ 
dicates the visual presence of step k at time interval t in 
video n , and T n is the length of video n. Similarly to R n , 
we allow the possibility that a whole row of Z n is zero, indi¬ 
cating that no step is visually present for the corresponding 
time interval. 

We propose to tackle this problem using a discriminative 
clustering approach with global ordering constraints, as was 
successfully used in the past for the temporal localization 
of actions in videos [4], but with additional weak temporal 
constraints. In contrast to [4] where the order of actions 
was manually given for each video, our multiple sequence 
alignment approach automatically discovers the main steps. 
More importantly, we also use the text caption timing to 
provide a fine-grained weak temporal supervision for the 
visual appearance of steps, which is described next. 

Temporal weak supervision from text. From the out¬ 
put of the multiple sequence alignment (encoded in the ma¬ 
trix R n G {0,l} SnXK ), each direct object token d™ has 
been assigned to one of the possible K steps, or to no step 
at all. We use the tokens that have been assigned to a step as 


a constraint on the visual appearance of the same step in the 
video (using the assumption that people do what they say 
approximately when they say it). We encode the closed cap¬ 
tion timing alignment by a binary matrix A n G {0, l} Sn x Tn 
for each video, where (A n ) st is 1 if the 8 -th direct object is 
mentioned in a closed caption that overlaps with the time 
interval t in video. Note that this alignment is only approx¬ 
imate as people usually do not perform the action exactly at 
the same time that they talk about it, but instead with a vary¬ 
ing delay. Second, the alignment is noisy as people typically 
perform the action only once, but often talk about it multiple 
times (e.g. in a summary at the beginning of the video). We 
address these issues by the following two weak supervision 
constraints. First, we consider a larger set of possible time 
intervals [t — A&, t + A a ] in the matrix A rather than the ex¬ 
act time interval t given by the timing of the closed caption. 
A 5 and A a are global parameters fixed either qualitatively, 
or by cross-validation if labeled data is provided. Second, 
we put as a constraint that the action happens at least once 
in the set of all possible video time intervals where the ac¬ 
tion is mentioned in the transcript (rather than every time 
it is mentioned). These constraints can be encoded as the 
following linear inequality constraint on Z n \ A n Z n > R n 
(see Appendix C.2 for the detailed derivation). 

Ordering constraint. In addition, we also enforce that 
the temporal order of the steps appearing visually is consis¬ 
tent with the discovered script from the text, encoding our 
assumption that there is a common ordered script for the 
task across videos. We encode these sequence constraints 
on Z n in a similar manner to [5], which was shown to work 
better than the encoding used in [4]. In particular, we only 
predict the most salient time interval in the video that de¬ 
scribes a given step. This means that a particular step is 
assigned to exactly one time interval in each video. We de¬ 
note by Z n this sequence ordering constraint set. 

Discriminative clustering. The main motivation behind 
discriminative clustering is to find a clustering of the data 
that can be easily recovered by a linear classifier through 
the minimization of an appropriate cost function over the as¬ 
signment matrix Z n . The approach introduced in [2] allows 
to easily add prior information on the expected clustering. 
Such priors have been recently introduced in the context of 
aligning video and text [4, 5] in the form of ordering con¬ 
straints over the latent label variables. Here we use a simi¬ 
lar approach to cluster the N input video streams (x t ) into 
a sequence of K steps, as follows. We represent each time 
interval by a <i-dimensional feature vector. The feature vec¬ 
tors for the 77,-th video are stacked in a T n x d design matrix 
denoted by X n . We denote by X the T x d matrix obtained 
by the concatenation of all X n matrices (and similarly, by 
Z, R and A the appropriate concatenation of the Z n , R n 
and A n matrices over n). In order to obtain the temporal 
localization into K steps, we learn a linear classifier repre¬ 
sented by a d x K matrix denoted by W. This model is 
shared among all videos. 
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Precision 

0.9 

Precision 

0.4 

Precision 

1 

Precision 

0.67 

Precision 

0.83 

Recall 

0.9 

Recall 

0.57 

Recall 

0.86 

Recall 

0.6 

Recall 

0.42 


Table 1: Automatically recovered sequences of steps for the five tasks. Each recovered step is represented by one of the aligned direct 
object relations (shown in bold). Note that most of the recovered steps correspond well to the ground truth steps (shown in italic). The 
results are shown for the maximum number of discovered steps K set to 10. Note how our method automatically selects less than 10 
steps in some cases. These are the automatically chosen k < K steps that are the most salient in the aligned narrations as described in 
Sec. 4.1. For CPR , our method recovers fine-grained steps e.g. tilt head , lift chin , which are not included in the main ground truth steps, 
but nevertheless could be helpful in some situations, as well as repetitions that were not annotated but were indeed present. 


The target assignment Z is found by minimizing the 
clustering cost function h under both the consistent script 
ordering constraints Z and our weak supervision con¬ 
straints: 

minimize h(Z) s.t. , AZ > R. (2) 

ordered script weak textual 
constraints 

The clustering cost h(Z) is given as in DIFFRAC [2] as: 

h{Z) = wS2*‘ ( 3 ) 

Discriminative loss on data Regularizer 

The first term in (3) is the discriminative loss on the data 
that measures how easy the input data X is separable by the 
linear classifier W when the target classes are given by the 
assignments Z. For the squared loss considered in eq. (3), 
the optimal weights W* minimizing (3) can be found in 
closed form, which significantly simplifies the computation. 
However, to solve (2), we need to optimize over assignment 
matrices Z that encode sequences of events and incorporate 
constraints given by clusters obtained from transcribed tex¬ 
tual narrations (Sec. 4.1). This is again done by using the 
Frank-Wolfe algorithm, which allows the use of efficient dy¬ 
namic programs to handle the combinatorial constraints on 
Z. More details are given in Appendix C. 

5. Experimental evaluation 

In this section, we first describe the details of the text 
and video features. Then we present the results divided into 
two experiments: (i) in Sec. 5.1, we evaluate the quality of 
steps extracted from video narrations, and (ii) in Sec. 5.2, 
we evaluate the temporal localization of the recovered steps 
in video using constraints derived from text. All the data 
and code are available at our project webpage [1]. 


Video and text features. We represent the transcribed 
narrations as sequences of direct object relations. For this 
purpose, we run a dependency parser [8] on each transcript. 
We lemmatize all direct object relations and keep the ones 
for which the direct object corresponds to nouns. To rep¬ 
resent a video, we use motion descriptors in order to cap¬ 
ture actions (loosening, jacking-up, giving compressions) 
and frame appearance descriptors to capture the depicted 
objects (tire, jack, car). We split each video into 10-frame 
time intervals and represent each interval by its motion and 
appearance descriptors aggregated over a longer block of 30 
frames. The motion representation is a histogram of local 
optical flow (HOF) descriptors aggregated into a single bag- 
of-visual-word vector of 2,000 dimensions [30]. The vi¬ 
sual vocabulary is generated by k-means on a separate large 
set of training descriptors. To capture the depicted objects 
in the video, we apply the VGG-verydeep-16 CNN [28] 
over each frame in a sliding window manner over multi¬ 
ple scales. This can be done efficiently in a fully convolu¬ 
tional manner. The resulting 512-dimensional feature maps 
of conv5 responses are then aggregated into a single bag-of- 
visual-word vector of 1,000 dimensions, which aims to cap¬ 
ture the presence/absence of different objects within each 
video block. A similar representation (aggregated into com¬ 
pact VLAD descriptor) was shown to work well recently 
for a variety of recognition tasks [7]. The bag-of-visual- 
word vectors representing the motion and the appearance 
are normalized using the Hellinger normalization and then 
concatenated into a single 3,000 dimensional vector repre¬ 
senting each time interval. 

WordNet distance. For the multiple sequence alignment 
presented in Sec. 4.1, we set c(di, cfe) = — 1 if d\ and d<± 
have both their verbs and direct objects that match exactly 
in the Wordnet tree (distance equal to 0). Otherwise we set 
c(di, df) to be 100. This is to ensure a high precision for 
the resulting alignment. 





















(a) Change tire (11) 


(b) Perform CPR (7) 


(c) Repot plant (7) 


(d) Make coffee (10) 


(e) Jump car (12) 


Figure 3: Results for temporally localizing recovered steps in the input videos. We give in bold the number of ground truth steps. 


5.1. Results of step discovery from text narrations 

Results of discovering the main steps for each task from 
text narrations are presented in Table 1 . We report results of 
the multiple sequence alignment described in Sec. 4.1 when 
the maximum number of recoverable steps is K = 10. Ad¬ 
ditional results for different choices of K are given in the 
Appendix E.l. With increasing K , we tend to recover more 
complete sequences at the cost of occasional repetitions, 
e.g. position jack and jack car that refer to the same step. 
To quantify the performance, we measure precision as the 
proportion of correctly recovered steps appearing in the cor¬ 
rect order. We also measure recall as the proportion of the 
recovered ground truth steps. The values of precision and 
recall are given at the bottom of Table 1 . 

5.2. Results of localizing instruction steps in video 

In the previous section, we have evaluated the quality of 
the sequences of steps recovered from the transcribed nar¬ 
rations. In this section, we evaluate how well we localize 
the individual instruction steps in the video by running our 
two-stage approach from Sec. 4. 

Evaluation metric. To evaluate the temporal localiza¬ 
tion, we need to have a one-to-one mapping between the 
discovered steps in the videos and the ground truth steps. 
Following [18], we look for a one-to-one global matching 
(shared across all videos of a given task) that maximizes the 
evaluation score for a given method (using the Hungarian 
algorithm). Note that this mapping is used only for evalua¬ 
tion, the algorithm does not have access to the ground truth 
annotations for learning. 

The goal is to evaluate whether each ground truth step 
has been correctly localized in all instruction videos. We 
thus use the FI score that combines precision and recall into 
a single score as our evaluation measure. For a given video 
and a given recovered step, our video clustering method 
predicts exactly one video time interval t. This detection 
is considered correct if the time interval falls inside any 
of the corresponding ground truth intervals, and incorrect 
otherwise (resulting in a false positive for this video). We 
compute the recall across all steps and videos, defined as 
the ratio of the number of correct predictions over the to¬ 
tal number of possible ground truth steps across videos. A 
recall of 1 indicates that every ground truth step has been 


correctly detected across all videos. The recall decreases 
towards 0 when we miss some ground truth steps (missed 
detections). This happens either because this step was not 
recovered globally, or because it was detected in the video at 
an incorrect location. This is because the algorithm predicts 
exactly one occurrence of each step in each video. Simi¬ 
larly, precision measures the proportion of correct predic¬ 
tions among all N • AT pr ed possible predictions, where N is 
the number of videos and iT prec j is the number of main steps 
used by the method. The FI score is the harmonic mean of 
precision and recall, giving a score that ranges between 0 
and 1 , with the perfect score of 1 when all the steps are pre¬ 
dicted at their correct locations in all videos. 

Hyperparameters. We set the values of parameters A 5 
and A a to 0 and 10 seconds. The setting is the same for all 
five tasks. This models the fact that typically each step is 
first described verbally and then performed on the camera. 
We set A = l/(NK pre ^) for all methods that use (3). 

Baselines. We compare results to four baselines. To 
demonstrate the difficulty of our dataset, we first evaluate 
a “Uniform” baseline, which simply distributes instructions 
steps uniformly over the entire instruction video. The sec¬ 
ond baseline “Video only” [4] does not use the narration and 
performs only discriminative clustering on visual features 
with a global order constraint . 5 The third baseline “Video + 
BOW dobj” basically adds text-based features to the “Video 
only” baseline (by concatenating the text and video features 
in the discriminative clustering approach). Here the goal 
is to evaluate the benefits of our two-stage clustering ap¬ 
proach, in contrast to this single-stage clustering baseline. 
The text features are bag-of-words histograms over a fixed 
vocabulary of direct object relations . 6 The fourth baseline 
is our own implementation of the alignment method of [19] 
(without the supervised vision refinement procedure that 
requires a set of pre-trained visual classifiers that are not 
available a-priori in our case). We use [19] to re-align the 
speech transcripts to the sequence of steps discovered by 
our method of Sec. 4.1 (as a proxy for the recipe assumed 


5 We use here the improved model from [5] which does not require 
a “background class” and yields a stronger baseline equivalent to our 
model (2) without the weak textual constraints. 

6 Alternative features of bag-of-words histograms treating separately 
nouns and verbs also give similar results. 



























































to be known in [19]). 7 To assess the difficulty of the task 
and dataset, we also compare results with a “Supervised” 
approach. The classifiers W for the visual steps are trained 
by running the discriminative clustering of Sec. 4.2 with 
only ground truth annotations as constraints on the training 
set. At test time, these classifiers are used to make predic¬ 
tions under the global ordering constraint on unseen videos. 
We report results using 5-fold cross validation for the super¬ 
vised approach, with the variation across folds giving the 
error bars. For the unsupervised discriminative clustering 
methods, the error bars represent the variation of perfor¬ 
mance obtained from different rounded solutions collected 
during the Frank-Wolfe optimization. 

Results. Results for localizing the discovered instruction 
steps are shown in Figure 3. In order to perform a fair com¬ 
parison to the baseline methods that require a known num¬ 
ber of steps K , we report results for a range of K values. 
Note that in our case the actual number of automatically 
recovered steps can be (and often is) smaller than K. For 
Change tire and Perform CPR , our method consistently out¬ 
performs all baselines for all values of K demonstrating the 
benefits of our approach. For Repot , our method is compa¬ 
rable to text-based baselines, underlying the importance of 
the text signal for this problem. For Jump car , our method 
delivers the best result (for K = 15) but struggles for lower 
values of K , which we found was due to visually similar 
repeating steps (e.g. start car A and start car B) which are 
mixed-up for lower values of K. For the Make coffee task, 
the video only baseline is comparable to our method, which 
by inspecting the output could be attributed to large vari¬ 
ability of narrations for this task. Qualitative results of the 
recovered steps are illustrated in Figure 4. 

6. Conclusion and future work 

We have described a method to automatically discover 
the main steps of a task from a set of narrated instruction 
videos in an unsupervised manner. The proposed approach 
has been tested on a new annotated dataset of challenging 
real-world instruction videos containing complex person- 
object interactions in a variety of indoor and outdoor scenes. 
Our work opens up the possibility for large scale learning 
from instruction videos on the Internet. Our model currently 
assumes the existence of a common script with a fixed or¬ 
dering of the main steps. While this assumption is often 
true, e.g. one cannot remove the wheel before jacking up the 
car, or make coffee before filling the water, some tasks can 
be performed while swapping (or even leaving out) some of 
the steps. Recovering more complex temporal structures is 
an interesting direction for future work. 
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Figure 4: Examples of three recovered instruction steps for each of 
the five tasks in our dataset. For each step, we first show clustered direct 
object relations, followed by representative example frames localizing the 
step in the videos. Correct localizations are shown in green. Some steps 
are incorrectly localized in some videos (red), but often look visually very 
similar. See Appendix E.2 for additional results. 
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Outline of Supplementary Material 

This supplementary material provides additional details for our 
method and presents a more complete set of results. Section A 
gives detailed statistics and an illustration of the newly collected 
dataset of instruction videos. Section B gives details about our new 
formulation of the multiple sequence alignment problem (Sec¬ 
tion 4.1 of the main paper) as a quadratic program and presents 
empirical results showing that our Frank-Wolfe optimization ap¬ 
proach obtains solutions with lower objective values than the state- 
of-the-art heuristic algorithms for multiple sequence alignment. 
Section C provides the details for the discriminative clustering 
of videos with text constraints that was briefly described in Sec¬ 
tion 4.2 of the main paper. Section D gives additional details 
about the experimental protocol used in Section 5.2 in the main 
paper. Finally, in Section E, we give a more complete set of qual¬ 
itative results for both the clustering of transcribed verbal instruc¬ 
tions (see E.l) and localizing instruction steps in video (see E.2). 

A. New challenging dataset of instruction 
videos 

A.l. Dataset statistics 

In this section, we introduce three different scores which aim 
to illustrate different properties of our dataset. The scores charac¬ 
terize (i) the step ordering consistency, (ii) the missing steps and 
(iii) the possible step repetitions. 

Let N be the number of videos for a given task and K the 
number of steps defined in the ground truth. We assume that the 
ground truth steps are given in an ordered fashion, meaning the 
global order is defined as the sequence {1,..., K}. For the n- 
th video, we denote by g n the total number of annotated steps, 
by u n the number of unique annotated steps and finally by l n the 
length of the longest common subsequence between the annotated 
sequence of steps and the ground truth sequence {1,..., K}. 


Order consistency error. The order error score O is defined 
as the proportion of non repeated annotated steps that are not con¬ 
sistent with the global ordering. In other words, it is defined as 
the number of steps that do not fit the global ordering defined in 
the ground truth divided by the total number of unique annotated 
steps. More formally, O is defined as follows: 


0 := 1 — 


I 

2^n=1 in 

E n 

n=1 Un 


(4) 


Missing steps. We define the missing steps score M as the pro¬ 
portion of steps that are visually missing in the videos when com¬ 
pared to the ground truth. Formally, 


M :—l — 


Xm=l U1 

KN 


(5) 


Repeated steps. The repetition score R is defined as the pro¬ 
portion of steps that are repeated: 


R:= 1- 


E n 

n =1 Un 

E n 

n=l 9n 


( 6 ) 


Results. In Table 2, we give the previously defined statistics 
for the five tasks of the instruction videos dataset. Interestingly, 
we observed that globally the order is consistent for the five tasks 
with a total order error of only 6%. Steps are missing in 27% of 
the cases. This illustrates the difficulty of defining the right gran¬ 
ularity of the ground truth for this task. Indeed, some steps might 
be optional and thus not visually demonstrated in all videos. Fi¬ 
nally the global repetition score is 14%. Looking more closely, 
we observe that the Performing CPR task is the main contributor 
to this score. This is obviously a good example where one needs 
to repeat several times the same steps (here alternating between 
compressions and giving breath). Even if our model is not ex¬ 
plicitly handling this case, we observed that our multiple sequence 
alignment technique for clustering the text inputs discovered these 
repetitions (see Table 4). Finally, these statistics show that the 
problem introduced in this paper is very challenging and that de¬ 
signing models which are able to capture more complex structure 
in the organization of the steps is a promising direction for future 
work. 

A. 2. Complete illustration of the dataset 

Figure 6 illustrates all five tasks in our newly collected dataset. 
For each task, we show a subset of 3 events that compose the task. 
Each event is represented by several sample frames and extracted 
verbal narrations. Note the large variability of verbal expressions 
and the terminology in the transcribed narrations as well as the 
large variability of visual appearance due to viewpoint, used ob¬ 
jects, and actions performed in different manner. At the same time, 
note the the consistency of the actions between the different videos 
and the underlying script of each task. 

B. Clustering transcribed verbal instructions 

In this section, we review in details the way we model the text 
clustering. In particular, we give details on how we can refor¬ 
mulate multiple sequence alignment as a quadratic program. Re¬ 
call that we are given N narrated instruction videos. For the n-th 
video, the text signal is represented as a sequence of direct object 
relation tokens : d n = (d ™,..., eZg n ), where the length S n of 
the sequences varies from one video clip to another. The number 
of possible direct object relations in our dictionary is denoted D. 
The multiple sequence alignment (MSA) problem was formulated 
as mapping each input sequence d n of tokens to a global com¬ 
mon template of L slots, while minimizing the sum-of-pairs score 
given in (1). For each input sequence d n , we used the notation 
(0(d n )) i<i<l to denote the re-mapped sequence of tokens into L 
slots: 4>(d n )i represents the direct object relation put at location Z, 
with f(d n )i — 0 denoting that a gap was inserted in the original 
sequence and the slot l is left empty. We also have defined a cost 
c(di, cfo) of aligning two direct object relations together, with the 
possibility that cZi or d<i is 0, in which case we defined the cost to 
be 0 by default. In the following, we summarize the cost of align¬ 
ing non-empty direct object relations by the matrix C 0 E R DxD . 
[Co)a is equal to the cost of aligning the z-th and the j -th direct 
object relation from the dictionary together. 

B.l. Reformulating multiple sequence alignment as 
a quadratic program 

We now present our formalization of the search problem as a 
quadratic program. To the best of our knowledge this is a new 





Task 

Changing tire 

Performing CPR 

Repoting plant 

Making coffee 

Jumping cars 

Average 

Order error 

0.7% 

11% 

6% 

3% 

8% 

6% 

Missing steps 

16% 

32% 

30% 

28% 

27% 

27% 

Repetition score 

4% 

50% 

7% 

11% 

0.4% 

14% 


Table 2: Statistics of the instruction video dataset. 


formulation of the multiple sequence alignment (MSA) problem, 
which in our setting (results shown later) consistently obtains bet¬ 
ter values of the multiple sequence alignment objective than the 
current state-of-the-art MSA heuristic algorithms. 

We encode the identity of a direct object relation with a D- 
dimensional indicator vector. The text sequence n can then be 
represented by an indicator matrix Y n £ {0, l} SnXZ: \ The j -th 
row of Y n indicates which direct object relations is evoked at the j- 
th position. Similarly, the token re-mapping (4>(d n ))i<i<L can be 
represented as a L x D indicator matrix; where each row l encodes 
which token is appearing in slot l (and a whole row of zero is used 
to indicates an empty 0 slot). This re-mapping can be constructed 
from two pieces of information: first, which token index s of the 
original sequence is re-mapped to which global template slot /; we 
represent this by the decision matrix U n E {0, l} SnXL , which 
satisfies very specific constraints (see below). The second piece 
of information is the composition of the input sequence encoded 
by Y n . We thus have 4>(d n ) = UnY n (as a L x D indicator 
matrix). Given this encoding, the cost matrix C 0 , and the fact that 
the alignment of empty slots has zero cost, we can then rewrite 
the MSA problem that minimizes the sum-of-pairs objective (1) as 
follows: 

minimize V Tr {U^Y n C 0 Y^U m ) 

U n ,ne{l,...,N} ^ 

(n,m) ( 7 ) 

subject to U n 6 Un, n = 1 ,..., N. 

In the above equation, the trace (Tr) is computing the cost of align¬ 
ing sequence m with sequence n (the inner sum in (1)). Moreover, 
U n is a constraint set that encodes the fact that U n has to be a valid 
(increasing) re-mapping. 8 As before, we can eliminate the video 
index n by simply stacking the assignment matrices U n in one ma¬ 
trix U of size S x L. Similarly, we denote Y the S x D matrix 
which is obtained by the concatenation of all the Y n matrices. We 
can then rewrite the equation (7) as a quadratic program over the 
(integer) variable U : 

minimize Tr (U T BU), subject to U £ Z7. (8) 

In this equation, the S X S matrix B is deduced from the input 
sequences and the cost between different direct object relations by 
computing B := YC 0 Y T . It represents the pairwise cost at the 
token level, i.e. the cost of aligning token s in one sequence to 
token s' in another sequence. 

B.2. Comparison of methods 

The problem (8) is NP-hard [31] in general, as is typical for 
integer quadratic programs. However, much work has been done 
in computational biology to develop efficient heuristics to solve 

8 More formally U n := {U £ {0, l} 5nXL s.t. U\l = 1 s n and 
VI, (U a i = 1 => ((Vs' >s,l'<l), U a , v = 0)}. 


the MSA problem, as it is an important problem in their field. We 
briefly describe below some of the existing heuristics to solve it, 
and then present our Frank-Wolfe optimization approach, which 
gave surprisingly good empirical results for our problem. 9 

Standard methods. Here, we compare to a standard state-of- 
the-art method for multiple sequence alignment [17]. Similarly 
to [12], they first align two sequences and merge them in a com¬ 
mon template. Then they align a new sequence to the template and 
then update the template. They continue like this until no sequence 
is left. Differently from [12], they use a better representation of the 
template by using partial order graph instead of simple linear rep¬ 
resentations. This gives more accuracy for the final alignment. For 
the experiments, we use the author’s implementation. 10 

Our solution using Frank-Wolfe optimization. We first 
note that problem (8) has a very similar structure to an optimiza¬ 
tion problem that we solve using Frank-Wolfe optimization for the 
discriminative clustering of videos; see Equations (12) and (13) 
below. For this, we first perform a continuous relaxation of the 
set of constraints U by replacing it with its convex hull U. The 
Frank-Wolfe optimization algorithm [13] can solve quadratic pro¬ 
gram over constraint sets for which we have access to an efficient 
linear minimization oracle. In the case of Z7, the linear oracle can 
be solved exactly with a dynamic program very similar to the one 
described in Section C.2. We note here that even with the contin¬ 
uous relaxation over U , the resulting problem is still non-convex 
because B is not positive semidefinite - this is because of the cost 
function appearing in the MSA problem. However, the standard 
convergence proof for Frank-Wolfe can easily be extended to show 
that it converges at a rate of 0(1/Vk) to a stationary point on 
non-convex objectives [33]. Once the algorithm has converged to 
a (local) stationary point, we need to round the fractional solution 
to obtain a valid encoding U. We follow here a similar round¬ 
ing strategy that was originally proposed by [32] and then re-used 
in [14]: we pick the last visited corner (which is necessarily in¬ 
teger) which was given as a solution to the linear minimization 
oracle (this is called Frank-Wolfe rounding). 

Results. In Table 3, we give the value of the objective (8) for 
the rounded solutions obtained by the two different optimization 
approaches (lower is better), for the MSA problem on our five 
tasks. Interestingly, we observe that the Frank-Wolfe algorithm 

9 We stress here that we do not claim that our formulation of the mul¬ 
tiple sequence alignment (MSA) problem as a quadratic program outper¬ 
forms the state-of-the-art computational biology heuristics for their MSA 
problems arising in biology. We report our observations on application of 
multiple sequence alignment to our application, which might have a struc¬ 
ture for which these heuristics are not as appropriate. 

10 Code available at http://sourceforge.net/projects/ 
poamsa/. 







Task 

Changing tire 

Performing CPR 

Repotting plant 

Making coffee 

Jumping cars 

Poa [17] 

11.30 

-3.82 

1.65 

-2.99 

4.55 

Ours using Frank-Wolfe 

-5.18 

-4.51 

-3.55 

-3.86 

-4.67 


Table 3: Comparison of different optimization approaches for solving problem (8). (Objective value, lower is better). 


consistently outperforms the state-of-the-art method of [17] in our 
setting. 

C. Discriminative clustering of videos under 
text constraints 

We give more details here on the discriminative clustering 
framework from [4, 5] (and our modifications to include the text 
constraints) that we use to localize the main actions in the video 
signal. 

C.l. Explicit form of h(Z) 

We recall that h(Z) is the cost of clustering all the video 
streams {x n } : n — 1,..., N, into a sequence of K steps. The 
design matrix X G R Txd contains the feature describing the time 
intervals in our videos. The indicator latent variable Z G Z := 
{0,1} TXK encodes the visual presence of a step k at a time inter¬ 
val t. Recall also that X and Z contains the information about all 
videos n G {1,..., iV}. Finally, W G R dxK represents a linear 
classifier for our K steps, that is shared among all videos. We now 
derive the explicit form of h(Z) as in the DIFFRAC approach [2], 
though yielding a somewhat simpler expression (as in [5]) due to 
our use of a (weakly regularized) bias feature in X instead of a 
separate (unregularized) bias b. Consider the following joint cost 
function f on Z and W defined as 

f(z,w) = ±\\z - xwf F + ±\\w\\ 2 F . (9) 

The cost function / simply represents the ridge regression objec¬ 
tive with output labels Z and input design matrix X. We note 
that / has the nice property of being jointly convex in both Z and 
W, implying that its unrestricted minimization with respect to W 
yields a convex function in Z. This minimization defines our clus¬ 
tering cost h(Z) \ rewriting the definition of h with the joint cost / 
from (9), we have: 

h(Z)= min f(Z,W). (10) 

W £R dxK 

As / is strongly convex in W (for any Z), we can obtain its unique 
minimizer W*(Z) as a function of Z by zeroing its gradient and 
solving for W. For the case of the square loss in equation (9), the 
optimal classifier W* (Z) can be computed in closed form: 

W*(Z) = ( X T X + T\I d )~ 1 X T Z, (11) 

where Id is the d-dimensional identity matrix. We obtain the 
explicit form for h(Z) by substituting the expression (11) for 
W*(Z) in equation (9) and properly simplifying the expression: 

h(Z) = f(Z, W *) = j-Tr(ZZ T B), (12) 


where B := It — X(X T X + TXId)~ 1 X T is a strictly positive 
definite matrix (and so h is actually strongly convex). The cluster¬ 
ing cost is a quadratic function in Z, encoding how the clustering 
decisions in one interval t interact with the clustering decisions in 
another interval t' . In the next section, we explain how we can 
optimize the clustering cost h(Z) subject to the constraints from 
Section 4.2 using the Frank-Wolfe algorithm. 

C.2. Frank Wolfe algorithm for minimizing h(Z) 

The localization of steps in the video stream is done by solving 
the following optimization problem (repeated from (2) here for 
convenience): 

minimize h(Z) s.t. Z G Z , AZ > R . (13) 

ordered script weak textual 

constraints 

where Z is the latent assignment matrix of video time intervals 
to K clusters and R is the matrix of assignments of direct object 
relations in text to K clusters. Note that R is obtained from the 
text clustering using multiple sequence alignment as described in 
Section 4.1 and B.l, and is fixed before optimizing over Z. R is a 
S x K matrix obtaining by picking the K main columns of the U 
matrix defined in Section B.l. This selection step was described 
in the “extracting the main steps” paragraph in Section 4.1. 

The constraint set encodes several concepts. First, it imposes 
the temporal consistency between the text stream and the video 
stream. We recall that this constraint was written as AZ > R, n 
where A encodes the temporal alignment constraints between 
video and text (type I). Second, it includes the event ordering con¬ 
straints within each video input (type II). Finally, it encodes the 
fact that each event is assigned to exactly one time interval within 
each video (type III). The last two constraints are encoded in the 
set of constraints Z. To summarize, let Z denote the resulting 
(discrete) feasible space for Z i.e. Z := {Z G Z \ AZ > R}. 

We are then left with a problem in Z which is still hard to 
solve because the set Z is not convex. To approximately optimize 
h over Z , we follow the strategy of [4, 5]. First, we optimize h 
over the relaxed conv(Z) by using the Frank-Wolfe algorithm to 
get a fractional solution Z* G conv(Z). We then find a feasible 
candidate Z G Z by using a rounding procedure. We now give the 
details of these steps. 

First we note that the linear oracle of the Frank-Wolfe algo¬ 
rithm can be solved separately for each video n. Indeed, because 
we solve a linear program, there is no quadratic term that brings 
dependence between different videos in the objective, and more¬ 
over all the constraints are blockwise in n. Thus, in the following, 

11 When R s k = 0, then this constraint does not do anything. When 
R s k = 1 (i.e. the text token s was assigned to the main action k), then the 
constraint enforces that ^tk > T where A s . represents which 

video frames are temporally close to the caption time of the text token s. It 
thus then enforces that at least one temporally close video frame is assigned 
to the main action k. 
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Figure 5: Illustration of the dynamic programming solution 
to the linear program (14). The drawing shows a possible 
cost matrix C and an optimal path in red. The gray entries 
in the matrix C correspond to the values from the matrix 
C. The white entries have minimal cost and are thus always 
preferred over any gray entry. Note that we display C in a 
transpose manner to better fit on the page. 

we will give details for one video only by adding an index n to Z, 
to Z and to T. 

The linear oracle of the Frank-Wolfe algorithm can be solved 
via an efficient dynamic program. Let us suppose that the linear 
oracle corresponds to the following problem: 

min Tr (Cjz n ), (14) 

Z n £ Z n 

where C n E M Tn x K is a cost matrix that arises by computing the 
gradient of h with respect to Z n at the current iterate. The goal of 
the dynamic program is to find which entries of Z n are equal to 1 , 
recalling that ( Z n ) t k — 1 means that the step k was assigned to 
time interval t. From the constraint of type III (unique prediction 
per step), we know that each column k of Z n has exactly one 1 
(to be found). From the ordering constraint (type II), we know 
that if ( Z n )tk — 1, then the only possible locations for a 1 in 
the (k + l)-th column is for t' > t (i.e. the pattern of l’s is going 
downward when traveling from left to right in Z n ). Note that there 
can be “jumps” in between the time assignment for two subsequent 
steps k and k + 1. In order to encode this possibility using a 
continuous path search in a matrix, we insert dummy columns into 
the cost matrix C. We first subtract the minimum value from C 
and then insert columns filled with zeros in between every pair of 
columns of C. In the end, we pad C with an additional row filled 
with zeros at the bottom. The resulting cost matrix C is of size 
(T n + 1 ) x (2 K + 1 ) and is illustrated (as its transpose) along 
with the corresponding update rules in Figure 5. 

The problem that we are interested in is subject to the addi¬ 
tional linear constraints given by the clustering of text transcripts 
(constraints of type I). These constraint can be added by constrain¬ 
ing the path in the dynamic programming algorithm. This can be 
done for instance by setting an infinite alignment cost outside of 
the constrained region. 

At the end of the Frank-Wolfe optimization algorithm, we ob¬ 
tain a continuous solution Z* for each n. By stacking them all 
together again, we obtain a continuous solution Z* . From the def¬ 


inition of h, we can also look at the corresponding model W* (Z*) 
defined by equation (11) which again is shared among all videos. 
All Z* have to be rounded in order to obtain a feasible point for 
the initial, non relaxed problem. Several rounding options were 
suggested in [5]; it turns out that the one which uses W* gives 
better results in our case. More precisely, in order to get a good 
feasible binary matrix Z n E Z n , we solve the following problem: 
min z ^ G ^ || Z n — X n W*\\%. By expanding the norm, we no¬ 
tice that this corresponds to a simple linear program over Z n as 
in equation (14) that can be solved using again the same dynamic 
program detailed above. Finally, we stack these rounded matrices 
Z n to obtain our predicted assignment matrix Z E Z. 

D. Experimental protocol 

In this section, we give more details about the setting for our 
experiments on the time localization of events with results given 
in Figure 3. 

D.l. Supervised experiments. 

Here, we describe in more details how we obtained the scores 
for the supervised approach depicted in yellow in Figure 3. We 
first divided the N input videos in 5 different folds. One fold is 
kept for the test set while the 4 other are used as train/validation 
dataset. With the 4 remaining folds, we perform a 4-fold cross val¬ 
idation in order to choose the hyperparameter A. Once the hyper 
parameter is fixed, we retrain a model on the 4 folds and evaluate 
it on the test set. By iterating over the five possible test folds, we 
report variation in performance with error bars in Figure 3. 

Training phase. The goal of this phase is to learn classifiers 
W for the visual steps. To that end, we minimize the cost defined 
in (2) under the ground truth annotations constraints. This is very 
close to our setting, and in practice we can use exactly the same 
framework as in problem (13) by simply replacing the constraints 
coming from the text by the constraints coming from the ground 
truth annotations. 

Testing phase. At test time, we simply use the classifiers W to 
perform least-square prediction of Z tes t under ordering constraints. 
Performance are evaluated with the FI score. 

D.2. Error bars for Frank-Wolfe methods. 

We explain here how we obtained the error bars of Figure 3 
in the main paper for the unsupervised approaches. Let us first 
recall that the Frank-Wolfe algorithm is used to solve a continu¬ 
ous relaxation of problem (13). To obtain back an integer solu¬ 
tion, we round the continuous solution using the rounding method 
described at the end of Section C.2. This rounding procedure is 
performed at each iteration of the optimization method. When 
the stopping criterion of the Frank-Wolfe scheme is reached (fixed 
number of iterations or target sub-optimality in practice), we have 
as many rounded solutions as number of iterations. Our output 
integer solution is then the integer point that achieves the lowest 
objective. Note that we are only guaranteed to diminish objective 
in the continuous domain and not for the integer points, therefore 
there are no guarantees that this solution is the last rounded point. 
In order to illustrate the variation of the performance with respect 
to the optimization scheme, we defined our error bars as being the 
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Figure 6: Illustration of our newly collected dataset of instructions videos. Examples of transcribed narrations together with still frames 
from the corresponding videos are shown for the 5 tasks of the dataset: Repotting a plant , Performing CPR , Jumping cars , Changing a car 
tire and Making coffee. The dataset contains challenging real-world videos performed by many different people, captured in uncontrolled 
settings in a variety of outdoor and indoor environments. 


interval with bounds determined by the minimal performance and 
the maximal performance obtained after visiting the best rounded 
point (the output solution). This notably explains why the error 
bars of Figure 3 are not necessarily symmetric. Overall, the ob¬ 
served variation is not very important, thus highlighting the stabil¬ 
ity of the procedure. 

E. Qualitative results 

In Section E.l, we give detailed results of script discovery for 
the five different tasks. In Section E.2, we present detailed results 
for the action localization experiment. 


E.l. Script discovery 

Table 4 shows the automatically recovered sequences of steps 
for the five tasks considered in this work. The results are shown 
for setting the maximum number of discovered steps, K = 
{7,10,12,15}. Note how our method automatically selects less 
than K steps in some cases. These are the automatically chosen 
k < K steps that are the most salient in the aligned narrations as 
described in Section 4.1. This is notably the case for the Repot¬ 
ting a plant task. Even for K < 12, the algorithm recovers only 
6 steps that match very well the seven ground truth steps for this 
task. This saliency based task selection is important because it al¬ 
lows for a better precision at high K without lowering much the 




























recall. 

Please note also how the steps and their ordering recovered by 
our method correspond well to the ground truth steps for each task. 
For CPR, our method recovers fine-grained steps e.g. tilt head , lift 
chin , which are not included in the main ground truth steps, but 
nevertheless could be helpful in some situations. For Changing 
tire , we also recover more detailed actions such as remove jack 
or put jack. In some cases, our method recovers repeated steps. 
For example, for CPR our method learns that one has to alternate 
between giving breath and performing compressions even if this 
alternation was not annotated in the the ground truth. Or for Jump¬ 
ing Cars our method learns that cables need to be connected twice 
(to both cars). 

These results demonstrate that our method is able to automat¬ 
ically discover meaningful scripts describing very different tasks. 
The results also show that the constraint of a single script pro¬ 
viding an ordering of events is a reasonable prior for a variety of 
different tasks. 

E.2. Action localization 

Examples of the recovered instruction steps for all five tasks are 
shown in Figure 7-11. Each row shows one recovered step. For 
each step, we first show the clustered direct object relations, fol¬ 
lowed by representative example frames localizing the step in the 
videos. Correct localizations are shown in green. Some steps are 
incorrectly localized in some videos (red), but often look visually 
very similar. Note how our method correctly recovers the main 
steps of the task and localizes them in the input videos. Those re¬ 
sults have been obtained by imposing K < 10 in our method. The 
video on the project website illustrates action localization for the 
five tasks. 
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GT (11) 

K <7 

K < 10 

K <12 

K < 15 

put brake on 
get tools out 


get tire 

get tire 

get tire 

start loose 

loosen nut 

loosen nut 

loosen nut 

loosen nut 





lift car 


putjack 

put jack 

putjack 

putjack 



raise vehicle raise vehicle 

jack car 

jack car 

jack car 

jack car 

jack car 

unscrew wheel 

remove nut 

remove nut 

remove nut 

remove nut 

remove wheel 


take wheel 

take wheel 

take wheel 

put wheel 

take tire 

take tire 

take tire 

take tire 

screw wheel 


put nut 

put nut 

put nut 

lower car 

lower jack 

lower jack 

lower jack 

lower jack 




remove jack 

tight wheel 

tighten nut 

tighten nut 

tighten nut 

tighten nut 

put things back 



take tire 

take tire 

Precision 

0.85 

0.9 

0.83 

0.71 

Recall 

0.54 

0.9 

0.9 

0.9 

(a) Changing a tire 

GT (7) 

K <7 

K < 10 

K < 12 

K < 15 

cover hole 




take piece 





keep soil 
stop soil 

take plant 

take plant 

take plant 

take plant 

take plant 

put soil 

use soil 

use soil 

use soil 

use soil 

loosen root 

loosen soil 

loosen soil 

loosen soil 

loosen soil 

place plant 

place plant 

place plant 

place plant 

place plant 

add top 

add soil 

add soil 

add soil 

add soil 





fill pot 
get soil 





give drink 

water plant 

water plant 

water plant 

water plant 

water plant 
give watering 

Precision 

1 

1 

1 

0.54 

Recall 

0.86 

0.86 

0.86 
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GT (10) 

K <7 

K < 10 

K < 12 

K < 15 

grind coffee 
put filter 
add coffee 


put coffee 

put coffee 

put coffee 

even surface 


fill chamber 

fill chamber 

fill chamber 





make noise 

fill water 

fill water 

fill water 

fill water 

fill water 

screw top 


put filter 

put filter 

put filter 
fill basket 



see steam 

see steam 

see steam 

put stove 

take minutes take minutes 

take minutes 

take minutes 


make coffee 

make coffee 

make coffee 

make coffee 

see coffee 

see coffee 

see coffee 

see coffee 

see coffee 

withdraw stove 




turn heat 

pour coffee 

make cup 

make cup 

make cup 

make cup 
pour coffee 

Precision 

0.8 

0.67 

0.67 

0.54 

Recall 

0.4 

0.6 

0.6 

0.7 

(b) Making coffee 

GT (7) 

K <7 

K < 10 

K < 12 

K < 15 

open airway 
check response 
call 911 

open airway 

open airway 

open airway 

open airway 

check breathing 
check pulse 


put hand 

put hand 

put hand 


tilt head 

tilt head 

tilt head 

tilt head 


lift chin 

lift chin 

lift chin 

lift chin 

give breath 

give breath 

give breath 

give breath 

give breath 

give compression do compr. 

do compr. 

do compr. 

do compr. 


open airway 

open airway 

open airway 

open airway 



start compr. 

start compr. 

start compr. 
continue cpr 



do compr. 

do compr. 

do compr. 
put hand 



give breath 

give breath 

give breath 

Precision 

0. 5 

0.4 

0.4 

0.33 

Recall 

0.43 

0.57 

0.57 

0.57 


(c) Repot a plant (d) Performing CPR 


GT (12) 

K <7 

K < 10 

K < 12 

K < 15 

get cars 
open hood 




have terminal 




attach cab. 

attach cab. 

connect red A 

connect cable 

conn, cable 

conn, cable 

conn, cable 

connect red B 

charge battery 
connect end 

charge batt. 
conn, end 

charge batt. 
conn, end 

conn, clamp 
charge batt. 
conn, end 

connect black A 



conn. cab. 

conn. cab. 

connect ground 



have cab. 

have cab. 

start car A 

start car 

start car 

start car 

start car 

start car B 



start vehicle 

start veh. 

remove ground 

remove cable 

rem. cable 

start engine 
rem. cable 

start eng. 
rem. cable 

remove black A 
remove red B 
remove red A 

disconnect cable 

disc, cable 

disc, cable 

disc, cable 

Precision 

0.83 

0.83 

0.72 

0.69 

Recall 

0.42 

0.42 

0.67 

0.67 


(e) Jumping cars 


Table 4: Automatically recovered sequences of steps for the five tasks considered in this work. Each recovered step is represented by 
one of the aligned direct object relations (shown in bold). Note that most of the recovered steps correspond well to the ground truth steps 
(showed in italic). The results are shown for setting the maximum number of discovered steps, K = {7,10,12,15}. Note how our method 
automatically selects less than K steps in some cases. These are the automatically chosen k < K steps that are the most salient in the 
aligned narrations as described in Sec. 4.1. 
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Figure 7 : Examples of the recovered instruction steps for the task “Changing the car tire”. 
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Figure 8 : Qualitative results for the task “Jumping cars”. 



























































GT 



O 

(/) 

H 

D 

CL 



Z «/> 

LU 

°§ 

9 cc 


LU H 

u z 


S 3 


CL Q. 


oc H 

LU 2 



Recovered 

steps 


o 

10 

CD 

10 

D 


JU Q- 
Q. (U 
<D § 
C 

fD C 
•m oj 



Action localization 



Figure 9: Qualitative results for the task “Repot a plant”. 
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Figure 10: Qualitative results for the task “Making coffee”. 
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Figure 11 : Qualitative results for the task “Performing CPR”. 













































































